Basic Pandas

A short workshop run by the Library Bioinformatics Service

Based on the Data Carpentry curriculum for Data Visualization in Python ( © Data Carpentry under Creative Commons Attribution license )

The Basics of Python

Python is a general purpose programming language that supports rapid development of scripts and applications.

Python’s main advantages:

Open Source software, supported by Python Software Foundation
Available on all platforms
It is a general-purpose programming language
Supports multiple programming paradigms
Very large community with a rich ecosystem of third-party packages

LOADING THIS JUPYTER NOTEBOOK

The following instructions will show you how to download and open this jupyter notebook interactively, with the notes and exercises for the introductory workshop.

Download the notebook from https://github.com/oxpeter/library_bioinformatics_service/archive/master.zip
Unzip the folder that downloads
Download Anaconda from https://www.anaconda.com/download/ (select your operating system, then click 'Download'
Run the Anaconda installer you just downloaded
Open Anaconda
Open Jupyterlab (click 'Launch')
In Jupyterlab, navigate to where you downloaded the notebook folder
Navigate to "library_bioinformatics_service-master > Pandas"
Double click on "Basic Pandas.ipynb"
The notebook should now load!

Using Python in Jupyterlab

This Jupyter notebook has been built to run on a python 3 'kernel'. That means you can enter python code into the code cells, and it will be able to execute as python.

To execute code, select the cell(s), then either:

* Choose "Run Selected Cells" from the Command tab on the left
* Click the run button (triangle) at the top of the notebook
* Press "Shift + Enter"

Try to execute the code in the cell below:



In [ ]:

    
# Like this first line, anything following a hash character (for the rest of that line) is considered a comment, and won't be run as code
text_str = "Congratulations, you've just run some Python code!"
print(text_str)

You will notice that the output of the cell is placed immediately underneath the cell, and that a number appears to the left of the cell to indicate the order in which the cell was run :

In [1]:

You can modify and execute each cell in this notebook as often as you wish, in any order that you wish. Any variables you create in one cell are available for use in another. For example:



In [ ]:

    
print(text_str)

You should also notice, that if you modify the variable in one cell, it will only update in another cell when you re-run it.

Challenge question - running Jupyterlab
Try to modify the text_str variable in the above cell, then re-execute the cells to see how they update.

Introduction to Python built-in data types

Strings, integers and floats

One of the most basic things we can do in Python is assign values to variables:



In [ ]:

    
text = "Weill Cornell Medicine"     # An example of a string
number = 42                         # An example of an integer
pi_value = 3.1415                   # An example of a float

Here we've assigned data to the variables text, number and pi_value, using the assignment operator =. To review the value of a variable, we can type the name of the variable into the cell and press Shift Enter:



In [ ]:

    
text

Everything in Python has a type. To get the type of something, we can pass it to the built-in function type:



In [ ]:

    
type(text)



In [ ]:

    
type(number)



In [ ]:

    
type(6.02)

The variable text is of type str, short for "string". Strings hold sequences of characters, which can be letters, numbers, punctuation or more exotic forms of text (even emoji!).

Operators

We can perform mathematical calculations in Python using the basic operators +, -, /, *, %:



In [ ]:

    
2 + 2  # Addition



In [ ]:

    
6 * 7  # Multiplication



In [ ]:

    
2 ** 16  # Power



In [ ]:

    
13 % 5  # Modulo

We can also use comparison and logic operators: <, >, ==, !=, <=, >= and statements of identity such as and, or, not. The data type returned by this is called a boolean.



In [ ]:

    
3 > 4



In [ ]:

    
True and True



In [ ]:

    
True or False

Sequential types: Lists and Tuples

Lists

Lists are a common data structure to hold an ordered sequence of elements. Each element can be accessed by an index. Note that Python indexes start with 0 instead of 1:



In [ ]:

    
numbers = [1, 2, 3]
numbers[0]

A for loop can be used to access the elements in a list or other Python data structure one at a time:



In [ ]:

    
for num in numbers:
    print(num)

Indentation is very important in Python. Note that the second line in the example above is indented. You can set an indent either by pressing the tab key, or by entering three spaces.

To find out what methods are available for an object in jupyterlab, press shift + tab when the cursor is at the end of the object (or inside the parentheses, for a function)

Tuples

A tuple is similar to a list in that it's an ordered sequence of elements. However, tuples can not be changed once created (they are "immutable"). Tuples are created by placing comma-separated values inside parentheses ().



In [ ]:

    
# Tuples use parentheses
a_tuple= (1, 2, 3)
another_tuple = ('blue', 'green', 'red')

# Note: lists use square brackets
a_list = [1, 2, 3]

Challenge - Tuples

What happens when you type a_tuple[2]=5 vs a_list[1]=5 ?

Type type(a_tuple) into python - what is the object type?



In [ ]:

    
# 1:



In [ ]:

    
# 2:

Dictionaries

A dictionary is a container that holds pairs of objects - keys and values.



In [ ]:

    
translation = {'one': 1, 'two': 2}
translation['one']

Dictionaries work a lot like lists - except that you index them with keys. You can think about a key as a name for or a unique identifier for a set of values in the dictionary. Keys can only have particular types - they have to be "hashable". Strings and numeric types are acceptable, but lists aren't.



In [ ]:

    
rev = {1: 'one', 2: 'two'}
rev[1]



In [ ]:

    
bad = {[1, 2, 3]: 3}

In Python, a "Traceback" is a multi-line error block printed out for the user.

To add an item to the dictionary we assign a value to a new key:



In [ ]:

    
rev = {1: 'one', 2: 'two'}
rev[3] = 'three'
rev

Using for loops with dictionaries is a little more complicated. We can do this in two ways:



In [ ]:

    
for key, value in rev.items():
   print(key, '->', value)



In [ ]:

    
for key in rev.keys():
    print(key, '->', rev[key])

Challenge - Can you do reassignment in a dictionary?

First create a dictionary challenge_dict with the following key:value pairs:

1 : 'uno'

2 : 'dos'

3 : 'tres'

Display the keys and values in the dictionary

Try to reassign the second value (in the key value pair) so that it no longer reads "dos" but instead reads "apple-sauce".

Now display your dictionary again to see if it has changed.



In [ ]:

    
# 1
challenge_dict =



In [ ]:

    
# 2



In [ ]:

    
# 3



In [ ]:

    
# 4

It is important to note that dictionaries are "unordered" and do not remember the sequence of their items (i.e. the order in which key:value pairs were added to the dictionary). Because of this, the order in which items are returned from loops over dictionaries might appear random and can even change with time.

Working With Pandas DataFrames in Python

We can automate the process of performing data manipulations in Python. It's efficient to spend time building the code to perform these tasks because once it's built, we can use it over and over on different datasets that use a similar format. This makes our methods easily reproducible. We can also easily share our code with colleagues and they can replicate the same analysis.

Our Data

For this lesson, we will be using NCD Risk Factor Collaboration (NDC-RisC) data, from Worldwide trends in body-mass index, underweight, overweight, and obesity from 1975 to 2016: a pooled analysis of 2416 population-based measurement studies in 128.9 million children, adolescents, and adults. Lancet 2017, published online 11 October 2017

We will be using the adult BMI dataset country-specific data.

The downloaded filename is NCD_RisC_Lancet_2017_BMI_age_standardised_country.csv, but we have saved a simplified version of the file here as NCD_RisC_bmi.csv.

We are studying the body-mass index (BMI) of individuals from 200 countries over a number of years. The dataset is stored as a .csv file: each row holds information for a single measurement period, and the columns represent:

Column	Description
country	Country
iso	Country code
sex	Gender of the group
year	Year measurements were taken
bmi	Mean BMI
lower95ci	lower 95% confidence interval
upper95ci	upper 95% confidence interval
obese_pc	prevalance of BMI>30 (obese)
obese_lower95ci	lower 95% confidence interval
obese_upper95ci	upper 95% confidence interval

The first few rows of our first file look like this:

country,iso,sex,year,bmi,lower95ci,upper95ci,obese_pc,obese_lower95ci,obese_upper95ci
Afghanistan,AFG,Men,1975,18.9994401470858,16.3311193806968,21.7630768066895,0.00207754471275783,0.000383525249039564,0.00651548709746884
Afghanistan,AFG,Men,1976,19.1055182251737,16.5000772402149,21.8011089747524,0.00223682902494332,0.000437170207444467,0.0068203531416254105
Afghanistan,AFG,Men,1977,19.2120802193391,16.6465068023822,21.8420206008889,0.0024101143094182,0.000497358014754166,0.00717898800855567
Afghanistan,AFG,Men,1978,19.3190525301356,16.7868986048326,21.8765159552146,0.00259867365703349,0.000558764696829654,0.007567109958970071

About Libraries

A library in Python contains a set of tools (called functions) that perform tasks on our data. Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench for use in a project. Once a library is set up, it can be used or called to perform many tasks.

Pandas in Python

One of the best options for working with tabular data in Python is to use the Python Data Analysis Library (a.k.a. Pandas). The Pandas library provides data structures, produces high quality plots with matplotlib and integrates nicely with other libraries that use NumPy (which is another Python library) arrays.

Python doesn't load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions. To import a library, we use the syntax import libraryName. If we want to give the library a nickname to shorten the command, we can add as nickNameHere. An example of importing the pandas library using the common nickname pd is below.



In [ ]:

    
import pandas as pd

Each time we call a function that's in a library, we use the syntax LibraryName.FunctionName. Adding the library name with a . before the function name tells Python where to find the function. In the example above, we have imported Pandas as pd. This means we don't have to type out pandas each time we call a Pandas function.

Reading CSV Data Using Pandas

We will begin by locating and reading our survey data which are in CSV format. CSV stands for Comma-Separated Values and is a common way store formatted data. Other symbols my also be used, so you might see tab-separated, colon-separated or space separated files. It is quite easy to replace one separator with another, to match your application. The first line in the file often has headers to explain what is in each column. CSV (and other separators) make it easy to share data, and can be imported and exported from many applications, including Microsoft Excel. For more details on CSV files, see the Data Organisation in Spreadsheets lesson. We can use Pandas' read_csv function to pull the file directly into a DataFrame.

So What's a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the data.frame in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.



In [ ]:

    
# Note that pd.read_csv is used because we imported pandas as pd
pd.read_csv("data/NCD_RisC_bmi.csv")

We can see that there were 16,800 rows parsed. Each row has 10 columns. The first column is the index of the DataFrame. The index is used to identify the position of the data, but it is not an actual column of the DataFrame. It looks like the read_csv function in Pandas read our file properly. However, we haven't saved any data to memory so we can work with it. We need to assign the DataFrame to a variable. Remember that a variable is a name for a value, such as x, or data. We can create a new object with a variable name by assigning a value to it using =.

Let's call the imported survey data df_surveys:



In [ ]:

    
df_surveys = pd.read_csv("data/NCD_RisC_bmi.csv")

Notice when you assign the imported DataFrame to a variable, Python does not produce any output on the screen. We can view the value of the df_surveys object by typing its name into the Python command prompt.



In [ ]:

    
df_surveys

which prints contents like above.

Note: if the output is too wide to print on your narrow terminal window, you may see something slightly different as the large set of data scrolls past. You may see simply the last column of data:

17        NaN  
18        NaN  
19        NaN  
20        NaN  
21        NaN  
22        NaN  
23        NaN  
24        NaN  
25        NaN  
26        NaN  
27        NaN  
28        NaN  
29        NaN  
...       ...  
35519    36.0  
35520    48.0  
35521    45.0  
35522    44.0  
35523    27.0  
35524    26.0  
35525    24.0  
35526    43.0  
35527     NaN  
35528    25.0  
35529     NaN  
35530     NaN  
35531    43.0  
35532    48.0  
35533    56.0  
35534    53.0  
35535    42.0  
35536    46.0  
35537    31.0  
35538    68.0  
35539    23.0  
35540    31.0  
35541    29.0  
35542    34.0  
35543     NaN  
35544     NaN  
35545     NaN  
35546    14.0  
35547    51.0  
35548     NaN  

[35549 rows x 9 columns]

Never fear, all the data is there, if you scroll up. Selecting just a few rows, so it is easier to fit on one window, you can see that pandas has neatly formatted the data to fit our screen:



In [ ]:

    
df_surveys.head() # The head() function displays the first several lines of a file. It is discussed below.

Exploring Our Obesity Survey Data

Again, we can use the type function to see what kind of thing df_surveys is:



In [ ]:

    
type(df_surveys)

As expected, it's a DataFrame (or, to use the full name that Python uses to refer to it internally, a pandas.core.frame.DataFrame).

What kind of things does df_surveys contain? DataFrames have an attribute called dtypes that answers this:



In [ ]:

    
df_surveys.dtypes

All the values in a column have the same type. For example, year has type int64, which is a kind of integer. Cells in the year column cannot have fractional values, but the bmi and obese_pc columns can, because they have type float64. The object type doesn't have a very helpful name, but in this case it represents strings (such as 'Men' and 'Women' in the case of sex).

We'll talk a bit more about what the different formats mean in a different lesson.

Useful Ways to View DataFrame objects in Python

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the DataFrame object.

To access an attribute, use the DataFrame object name followed by the attribute name df_object.attribute. Using the DataFrame df_surveys and attribute columns, an index of all the column names in the DataFrame can be accessed with df_surveys.columns.



In [ ]:

    
df_surveys.columns

Methods are called in a similar fashion using the syntax df_object.method(). As an example, df_surveys.head() gets the first few rows in the DataFrame df_surveys using the head() method. With a method, we can supply extra information in the parentheses to control behaviour.

Let's look at the data using these.

Challenge - DataFrames
Using our DataFrame df_surveys, try out the attributes & methods below to see what they return.

df_surveys.columns

df_surveys.shape Take note of the output of shape - what format does it return the shape of the DataFrame in?

HINT: More on tuples, here.

df_surveys.head() Also, what does df_surveys.head(15) do?

df_surveys.tail()

Calculating Statistics From Data In A Pandas DataFrame

We've read our data into Python. Next, let's perform some quick summary statistics to learn more about the data that we're working with. We can perform summary stats quickly using groups. But first we need to figure out what we want to group by.

Let's begin by exploring our data:



In [ ]:

    
# Look at the column names
df_surveys.columns

Let's get a list of all the years. The pd.unique function tells us all of the unique values in the year column.



In [ ]:

    
pd.unique(df_surveys['year'])

Challenge - Statistics

Create a list of unique countries found in the surveys data. Call it country_names. How many unique countries are there in the data? How many unique years are in the data?

What is the difference between len(country_names) and df_surveys['country'].nunique()?

Groups in Pandas

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average bmi of all countries per year.

We can calculate basic statistics for all records in a single column using the syntax below:



In [ ]:

    
df_surveys['bmi'].describe()

We can also extract one specific metric if we wish:



In [ ]:

    
df_surveys['bmi'].min()



In [ ]:

    
df_surveys['bmi'].max()



In [ ]:

    
df_surveys['bmi'].mean()



In [ ]:

    
df_surveys['bmi'].std()



In [ ]:

    
df_surveys['bmi'].count()

But if we want to summarize by one or more variables, for example sex, we can use Pandas' .groupby method. Once we've created a groupby DataFrame, we can quickly calculate summary statistics by a group of our choice.



In [ ]:

    
# Group data by sex
grouped_data = df_surveys.groupby('sex')

The pandas function describe will return descriptive stats including: mean, median, max, min, std and count for a particular column in the data. Pandas' describe function will only return summary values for columns containing numeric data.



In [ ]:

    
# Summary statistics for all numeric columns by sex
grouped_data.describe()



In [ ]:

    
# Provide the mean for each numeric column by sex
grouped_data.mean()

The groupby command is powerful in that it allows us to quickly generate summary stats.

Challenge - Summary Data

How many recorded individuals are women and how many are men?

What happens when you group by two columns using the following syntax and then grab mean values:

grouped_data2 = df_surveys.groupby(['country','sex'])

grouped_data2.mean()

Summarize bmi values for each country in your data. HINT: you can use the following syntax to only create summary statistics for one column in your data by_country['bmi'].describe()

Quickly Creating Summary Counts in Pandas

Let's next count the number of years measured for each country. We can do this in a few ways, but we'll use groupby combined with a count() method.



In [ ]:

    
# Count the number of samples by species
year_counts = df_surveys.groupby('country')['year'].count()
print(year_counts)

Or, we can also count just the rows with the USA:



In [ ]:

    
df_surveys.groupby('country')['year'].count()['United States of America']

Challenge - Make a list
What's another way to create a list of countries and an associated count of the records in the data? Hint: you can perform count, min, etc functions on groupby DataFrames in the same way you can perform them on regular DataFrames.

Basic Math Functions

If we wanted to, we could perform math on an entire column of our data. For example let's multiply all weight values by 2. A more practical use of this might be to normalize the data according to a mean, area, or some other value calculated from our data.



In [ ]:

    
# Multiply all weight values by 2
df_surveys_doubled = df_surveys['bmi'] * 2

Quick & Easy Plotting Data Using Pandas

We can plot our summary stats using Pandas, too.



In [ ]:

    
# Make sure figures appear inline in Jupyter Notebook
%matplotlib inline



In [ ]:

    
# Create a quick bar chart
year_bmi = df_surveys.groupby("year").mean()["bmi"]
year_bmi.plot(kind='bar');

We can also look at how many countries were measured in each year:



In [ ]:

    
total_count = df_surveys.groupby('year')['country'].nunique()
# Let's plot that too
total_count.plot(kind='bar');

(Perhaps a boring graph, but a good way to check that each year is consistent)

Challenge - Plots

Create a plot of average bmi across all years per country. (to resize your graph, try using the keyword figsize=(x-inches,y-inches)). Can you determine the highest average bmi is for a country? What is the lowest? To sort a dataframe by the values in a particular column, you can use the df.sort_values( column_name )

Create a plot of average male versus average female obesity prevalence for the entire dataset.



In [ ]:

    
df_surveys.groupby("country").mean()['bmi'].sort_values()

Summary Plotting Challenge
Create a stacked bar plot, with obesity prevalence on the Y axis, and the stacked variable being sex. The plot should show mean obesity prevalence by sex for each year. Some tips are below to help you solve this challenge:

For more on Pandas plots, visit this link.

You can use the code that follows to create a stacked bar plot but the data to stack need to be in individual columns. Here's a simple example with some data where 'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.



In [ ]:

    
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pd.DataFrame(d)

We can plot the above with



In [ ]:

    
# Plot stacked data so columns 'one' and 'two' are stacked
my_df = pd.DataFrame(d)
my_df.plot(kind='bar',stacked=True,title="The title of my graph")

You can use the .unstack() method to transform grouped data into columns for each plotting. Try running .unstack() on some DataFrames above and see what it yields.

Start by transforming the grouped data (by plot and sex) into an unstacked layout, then create a stacked plot.

Your solution:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Solution to Summary Challenge

>

First we group data by year and by sex, and then calculate a mean for each year.



In [ ]:

    
by_year_sex = df_surveys.groupby(['year','sex'])
year_sex_mean = by_year_sex['obese_pc'].mean()

This calculates the mean of bmis for each sex within each year as a table



In [ ]:

    
year_sex_mean.head()

>

Below we'll use .unstack() on our grouped data to figure out the BMI that each sex was measured at each year.



In [ ]:

    
spc = year_sex_mean.unstack()
spc

Now, create a stacked bar plot with that data where the BMi for each sex are stacked by year.

Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:



In [ ]:

    
s_plot = spc.plot(kind='bar',stacked=True,title="Mean BMI by year and sex")

s_plot.set_ylabel("Prevalence of obesity")
s_plot.set_xlabel("Year")



In [ ]: